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Aspect-based sentiment classification is vital in helping manufacturers 
identify the pros and cons of their products and features. In the latest days, 
there has been a tremendous surge of interest in aspect-based sentiment 
classification (ABSC). Since it predicts an aspect term sentiment polarity in 
a sentence rather than the whole sentence. Most of the existing methods have 


used recurrent neural networks and attention mechanisms which fail to 


capture global dependencies of the input sequence and it leads to some 
Keywords: information loss and some of the existing methods used sequence models for 
this task, but training these models is a bit tedious. Here, we propose the 
. multi-head attention transformation (MHAT) network the MHAT utilizes a 
Attention transformer encoder in order to minimize training time for ABSC tasks. 
Parts-of-speech First, we used a pre-trained Global vectors for word representation (GloVe) 
Sentiment for word and aspect term embeddings. Second, part-of-speech (POS) 
Transformer features are fused with MHAT to extract grammatical aspects of an input 
sentence. Whereas most of the existing methods have neglected this. Using 
the SemEval 2014 dataset, the proposed model consistently outperforms the 
state-of-the-art methods on aspect-based sentiment classification tasks. 


Aspect 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 


Abhinandan P. Shirahatti 

Department of CSE, KLE DR MSSCET, affiliated to VTU Belagavi 
R.C Nagar 2nd Stage Belagavi Karnataka, India 

Email: abhinandans2010 @ gmail.com 


1. INTRODUCTION 

Natural language processing is a domain at the intersection of artificial intelligence, computer 
science, and linguistics; here the main aim is to understand the natural language to perform the tasks such as 
questioning and answering language translations and review analysis. Sentiment analysis is nothing but 
contextual text mining that identifies the subjective information and extracts this information in the source 
material [1]. This source material helps in improving business through understanding the social sentiment of 
a particular service, product, or brand. Sentiment analysis (SA) allows the brand to make use of unstructured 
data, it possesses various advantages such as real-time analysis, and for instance, product review analysis can 
help increase the business [2]. 

Aspect-based sentiment classification (ABSC) is a fine-grained sentiment analysis type and is used 
to determine the sentiment (e.g. Positive, Negative, Neutral) of an aspect term that is specifically mentioned 
in the context [3], [4]. An example, 'Laptop's processing speed is incredible! However, the battery life is 
limited’, the sentiment polarity of the aspects 'processing speed is positive, but for ‘battery life' is negative. 
Aspect-based sentiment classification solves the limitation of sentence-level sentiment classification in that 
the sentiment polarity of each aspect may fluctuate when a sentence has multiple aspects. ABSC is split into 
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two phases: aspect extraction [5], [6] and sentiment classification [7], [8]. This research focuses solely on the 

ABSC task. In this field of study, most of the researchers used machine learning and neural network models. 

Natural language processing (NLP) features like parts-of-speech, and lexical units are used to train the 

traditional sentiment classification models [9]. For example, ABSC could be performed using a support 

vector machine (SVM) with well-designed handcrafted features [10]. In recent years, recurrent neural 
network models such as long short-term memory (LSTM) [11] and gated recurrent units (GRU) [12] have 
been widely used in aspect-based sentiment classification [13], [14]. Regardless of how effective methods 

these are, layer models encode words independently, which is tedious. To address this, [15] presented a 

similar technique based on convolutional neural networks (CNN). Even though CNNs are quite good at 

lowering training time, they are incapable of capturing long-term dependencies in sentences. Furthermore, 
aspect-level sentiment polarity is highly reliant on both the aspect and the review context. To incorporate 
aspect information, several models used an attention mechanism [16], [17]. When an aspect contains multiple 
words, these approaches overlook the distinct relevance of the words in the aspect phrase, resulting in 
information loss. We propose an efficient approach, called the multi-head attention transformation network 

(MHAT), in this study to resolve the difficulties raised above for ABSC tasks. MHAT first embeds context 

words and their associated parts-of-speech information in word embeddings and then generates 

contextualized word representations. 
The primary contributions of this paper are as shown in: 

- We introduce a new model (MHAT) that analyses words in sentences in parallel using a multi-head 
attention mechanism. MHAT is capable of accurately capturing the global interdependence of the words 
in a sentence. 

- Employed global vectors (GloVe) as an input embedding to amplify the effect of subsequent tasks. 


2. RELATED WORK 

Aspect-based sentiment classification has gained significant popularity in recent years. The bulk of 
established approaches relies on classical classifiers (for example, SVM) that are highly dependent on large- 
scale well-crafted features [9]. Such as bag-of-words, lexemes [18], [19]. However, there is a substantial 
impact on outcomes produced by these methods depending on the features quality. For example, [20] through 
huge number of instantaneous features target-specific sentiment classification is accomplished [21]. The 
supervised machine-learning approach for determining the aspect terms sentiment and aspect groups is 
described. These models, however, are heavily reliant on the quality of their features [22]. Proposed a 
framework that focuses on adjectives that appear before or after aspect phrases inside a specified context 
window [23]. Construct a sentiment lexicon in which word polarities are based on topics or domains using a 
hierarchical supervision topic model. The Stanford NLP tool is being used to recognize the part of speech of 
each word and to generate a syntax tree of the input sequences. Secondly, certain studies depend on statistical 
methods [24]. Employed supervised machine learning models, including naive Bayes, K-Nearest neighbor, 
decision trees, and support vector machines (SVM) with syntactic, morphological, and semantic features [25]. 

To integrate aspect information, attention techniques were applied to sequential models such as 
LSTM. Ma et al. [17] used an attention method to capture interactive information between the context and 
the aspect [26]. Attention-based LSTM models were used for classification task which improves the model 
performance. These models learn aspects and context relationships on a wider scale, which results in 
knowledge loss. To solve this, proposed a sentiment classification [27]. They used a close-grained attention 
model for word-level learning and coarse-grained attention to obtaining collective information of a sentence. 
GCSE [28] is a CNN and gating mechanism model, can recognize synchronization while learning. To 
improve the model's impact, and [29] used Graph convolutional networks (GCN) to model long-term word 
dependencies, extract grammatical features and create a dependency tree structure. 

The proposed approach is based on a transformer that incorporates multi-head attention levels to 
gather information and rely on residual connections [30], as well as a normalization layer [31]. In this case, 
the attention mechanism enables our system to parse embedded sequences concurrently. It receives 
information between every two words, allowing it to understand both word-level information and long-term 
dependencies in an input sentence. 


3. PROPOSED METHODOLOGY 
3.1. Task definition 

The structure of the proposed model MHAT is presented in this section. Figure 1, depicts the overall 
architecture. We are given a review sentence S = {s,,52,53,...S,} and the aspect term T = { t,tyt3,... tn}. In 
a given review sentence S, the sentiment type may be positive, negative, or neutral. 
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Figure 1. MHAT architecture 


3.2. Embedding layer 

In this paper we used an efficient word embedding encoder. That is GloVe has been extensively 
employed in several neural networks for natural language processing problems [32]. We derive word 
embedding and POS of each word is P = {p1, P2, P3, ..-Pn} € R™*“™», Wheredim,, and represents the 
dimension of both word and Parts-of-Speech embedding respectively. In order to attain the input 
representation of a word W = {w,,W2,W3,-..Wn} E R™*4'™w, we concatenated X and P. Where dim, = 
dim, + dim,. Then, word embedding and aspect term embedding are input into the corresponding 
transformer encoder. Embeddings are especially instrumental for aspect-based sentiment classification tasks 
and productively enhance the performance of the downstream task. 


3.3. Position encoding 

Position encoding is used to feed the model about the tokens relative positions. The embedding 
vector gains the positional encoding vector. Tokens with the same significance are closer together in d- 
dimensional space. The position encoding is calculated as (1) and (2). 


PE(pos, 2;) = sin( 25 (1) 
1000dh 
PE(pos, 2; + 1) = cos("4,) (2) 
1000dh 


Here, pos is the position, i is the dimension of the vector and distance between two tokens in the sentence 
would be described by the cosine functions. 


3.4. Transformer encoder 

Typically, a transformer encoder has two levels: a multi-head attention mechanism and a fully 
connected layer. Here, multihead attention mechanism (MHA) is used to capture to hidden states of inputs. 
Without employing sequence-aligned RNNs or convolution, it relies only on self-attention to construct 
representations of its input and output. 


3.4.1. Multi-head attention mechanism (MHA) 

Context-aspect word interaction is achieved through multi-head attention. Here, we represent a 
sequence of task-related Query vectors. Normally; a query g and set of key and value pairs are the inputs for 
an attention function. Where keys are represented as k = {k4, k3, k3, ... kn}. Presently in natural language 
processing tasks key and value are often the same, that is, key= value. In scaled dot-product attention, the 
weights are computed by the dot products of the query q and the keys as follows. The corresponding 
dissemination of attention is calculated by a method shown in (3). 


Attention(k, q) = Softmax (s(k, q)) (3) 


Indonesian J Elec Eng & Comp Sci, Vol. 26, No. 1, April 2022: 472-481 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 m) 475 


Here, s is the semantic gain and it calculates the semantic relevance between context and aspect 
word. This is also called a scoring mechanism, whose value is determined by a certain procedure and is 
depicted in (4). 


s = ktanh (|k; q;].Wa) (4) 


Here, MHA parallelizes the calculation of the input data. MHA acknowledges collaborative learning 
across delineation batches. The parameters are not going to be shared between the heads and here we are 
using eight heads. The reason behind this is that k and q values are constantly changing here heads are 
considered as a parallel scaled dot product mechanism and here the information from different subspaces is 
learned using linear projections. Finally, multi-head attention is computed as. 


head; = Attention(k, q) (5) 
MHA = (head, head, head; ...Dheadp). Wo (6) 


Where, h € [1,8] W, E R hidden)X(dimniaaen) h € [1,8]is the respected weight, dimpiaaen 
represents the dimension that is hidden. Attention alleviates the vanishing gradient problem by giving a direct 
path to the inputs. Here, attention is provided by adding context vector to previous blocks output and hidden 
state, and context vector is C; is shown in (7). 


C, = bey aij hj (7) 


where aj; denotes attention of it” output should pay to j*"input, and is defined by (8) and (9). 


exp(eij) 
aij = softmax(e;;) = = 
Dip exp(eik) 


(8) 


eij = f(Si-1, hi) (9) 


Here the alignment model f, scores how well the inputs around position j and the output at position 
i match, and S;_,,is the hidden state derived in the preceding time step. The alignment model can be 
computed as a basic dot product, multiplicative and additive way. 


3.5. Model training and regularization 
The proposed model is trained from beginning to end by reducing loss as much as possible with 
L, regularization, and is calculated as (10). 


Loss =—Y¥f-, F,log(y;) + AL2(@) (10) 


In our work’y,, indicates the sentiment polarity that was accurately anticipated, y; indicates the sentiment 
polarity that was successfully predicted for the supplied input sentence, where i, represents sentence index. 
L Is the regularization, and Ois parameters list. 


4. RESULTS AND DISCUSSION 
4.1. Dataset 

We evaluate MHAT on Laptop14 dataset, is from “SemEval 2014 Task4” dataset which is related 
aspect-based sentiment classification task [33], [34]. Table 1, shows the statistics of Laptop14 dataset. 
Figures 2-4 show the detail analysis of dataset. 


Table 1. Dataset details 


Dataset Positive Negative Neutral 
Train Test Train Test Train Test 
Laptop 994 341 870 128 464 169 
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Training Dataset Testing Dataset Validation Dataset 


positive negative neutral positive negative neutral positive negative neutral 


Figure 2. Bar graph representing the distribution of samples for each sentiment category in training, testing, 
and validation dataset 


Review Length Distribution 


Aspect Length Distribution 


2 3 4 5 6 7 8 9 1011 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 33 35 42 43 


Figure 3. Character length of review and aspect for the training dataset 


Training Word Count Distribution Test Word Count Distribution Validation Word Count Distribution 


Training Word Count Distribution Test Word Count Distribution Validation Word Count Distribution 


Figure 4. Word count distribution in training, testing, and validation dataset. The top row is for review 
content and the bottom row is for aspect 


4.1.1. Evaluation measure 
A common problem in all tasks is determining how to fairly evaluate the model's performance. 
Different research employs various evaluation metrics. As a result, performing a horizontal comparison 
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becomes quite difficult. We employ the currently common metric, accuracy value, to compare the 
performance of all models on ABSC. Accuracy is a measurement of how well a classification model 
performs. This quantity is known as the classification percentage, and it represents the ratio of a number of 
correct predictions to the total number of input samples. 


True€nositve t+TTUCnegative 


Accuracy = 


(1) 


Totalno_of_samples 


4.2. Baseline models 

In our experiments, the proposed model's performance is compared to other models. To ensure 
fairness and impartiality, the 300d GloVe word embeddings and batch size is 64 used in all baseline methods. 
All the baseline methods are listed in Table 2. 


Table 2. Displays the MHAT comparison findings with other baseline methods. Best result are in bold 


Model Laptop 2014 Accuracy (%) 
ATAE-LSTM [13] 68.70 
GCAE [15] 69.46 
TD-LSTM [14] 71.48 
IAN [18] 72.10 
MemNet [16] 72.34 
RAM [35] 74.51 
MGAN [27] 75.39 
Tnet-LF [8] 76.32 
MAN [36] 78.13 
MHAT 81.10 


Accuracy metrics were used as an evaluation metric for the proposed model. Table 2, displays the 
experimental outcomes and the proposed model outperforms as compared to baseline methods. Its accuracy, 
in particular, demonstrates an improvement of around 2.97% when compared to MAN, the current top model 
on this dataset. This is due to the fact that the multi-head-attention mechanism determines the relationship 
between every two words, including contextual information and long-term relationship dependencies, and 
also we have incorporated the POS features of words into our model, which may make our proposed model 
more successful at learning the shifting significance of words. 


4.3. Model training and analysis 

The proposed model is trained using the Laptop14 dataset by considering embedding dimension of 
300, with pre-trained GloVe-300 vectors, maximum aspect length as 8 and maximum review content length 
as 32, maximum tokens as 32000. The model has been trained 30 epochs and training results and model loss 
for each epoch are shown in Figure 5. The proposed model is able to achieve a training accuracy of about 
93% but whereas validation accuracy is about 73%. Whereas on testing dataset proposed model able to 


achieve accuracy of about 81.10% it shows in the Table 2. 


model accuracy 25 model loss 
0.9) = train —t— train T 
»— validation x— validation 
2.0 
0.8 
> 1.54 M 
£o.7 x Khu} 8 \ / 
a ye f V 8 \ A 
[o] \ L~ 
o 1.0 \ / a 
0.6 a Van 
0.5 
0.5 
0 10 20 30 0 10 20 30 
epoch epoch 


Figure 5. Training and validation accuracy and loss for each epoch 
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4.4. Model analysis 

Here, we examine the impact of MHAT modules, like a stack of layers and attention modules. The 
transformer encoder includes four attention layers in total, which have an effect on the proposed model's 
performance. The stacking attention layers are used to manage complicated sentiment connections in the 
input sequence. However, we evaluate the performance of the proposed model with one to four attention 
layers. The transformer encoder has 4 layers with 8-head multi-head attention and its model has 4* 8=32 
heads, so the model learns the relations of tokens of the input on 32 different standards. Figures 6-9, show the 
attentions heat maps from the first network layer to the fourth network layer of the network. We have chosen 
random sentences from the Laptop2014 dataset to demonstrate how attentions vary from layer to layer and 
also in heads. 


opery peper ji ‘ 
Head 2 Head 3 


peera parerii 
Head 6 Head 7 Head 8 


Figure 6. Attentions heat map for the first layer for all eight heads for review sentence "Laptop battery is very 
good and keyboard is also smooth" 


Figure 7. Attentions heat map for the second layer for all eight heads for review sentence "Laptop battery is 
very good and keyboard is also smooth" 
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Figure 8. Attentions heat map for the third layer for all eight heads for review sentence "Laptop battery is 
very good and keyboard is also smooth" 
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Figure 9. Attentions heat map for the fourth layer for all eight heads for review sentence "Laptop battery is 
very good and keyboard is also smooth" 


5. CONCLUSION 


In this work, we created a distinctive approach for tackling the aspect-based sentiment classification 
problem. We begin by emphasizing the limitations of existing approaches for tackling ABSC through an in- 
depth analysis. Thus, we developed the MHAT model; it uses an attention strategy to express the context- 
aspect relationship in a task. As the initial embedding, we start with pre-trained GloVe word vectors, which 
serve as the foundation for generating cutting-edge results in the next layers. The multi-head attention 
technique obtains hidden representations in this model. However, to increase the model's performance even 
further, we integrate POS features into the model, and that might be significant in capturing grammatical 
features of sentences, as well as providing crucial information about words and their adjacent components. 
Our model is more effective than other state-of-art methods. ABSC is a complicated and fine-grained work; 
hence there are still many unsolved challenges in this discipline. For example, some opinion expression plays 


two roles that is indicating sentiment and implying an (implicit) aspect (target). This will be investigated 
more in our future study. 
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